Final Project

Stephane Gurgenidze & Davit Chechelashvili

Table of Contents

  • About Problem
  • Task Details
  • Used Libraries
  • Dataset Preview
  • Dataset Columns
  • Get to Know Your Data
  • Creating Food Dataframe
  • Removing Outliers With IQR
  • Removing Outliers With Projection Method
  • Collating Food Entries
  • EDA (Exploratory Data Analysis)
  • Feature Engineering
  • Data Preperation for Clustering
  • Clustering With K-Means
  • Clustering With SOMPY
  • Usage
  • About Problem

    We have dataset of an application, which contains daily food intake entry for each customer. There were many ideas of problems that can be created and solved depending on this dataset, but we chose (in our opinion) the most interesting one. We will divide food into clusters, and analyse each cluster to detect what kind of food is prefered by the users of this application.

    Task Details

    As we mentioned above, our main problem is food clustering that will be solved using different algorithms like K-Means, PCA and SOMPY. But before clustering harder part of the problem is to process our data: parse its columns, get food entries, create united food dataset, analyse, clean and fill it, prepare for clustering and so on.

    To not drag this out, let's get to the task at hand.

    Used Libraries

    Setup

    Dataset Preview

    Dataset Columns

    To explore our dataset data types, let's take random row and observe it.

    user_id

    date

    food_entries

    aggregate_intake

    Get to Know Your Data

    Nutritions

    Let's define some useful functions to check how many different nutritions we have in our dataset.

    As we can see, we have 17 different nutritions, but not all of them are specified by users in every food entry, let's find out exactly how many are specified by user for each food.

    From this check we can infer that count of nutritions entered defining food can be from 1 to 6.

    We can see that almost 90% of food is entered with 6 defining nutritions. Which is quite nice, because more information is always good for us. Also, 7% is entered with 5 defining nutritions and almost 4% - with 4 defining nutritions. Other quanities don't even amount to 0.4 percent.

    Food

    Now let's define some useful functions to check how many different types of food we have in our dataset.

    Now, let's sort and check first and last 15 food by count.

    As we can notice, there are food that are entered only once. Let's check how many similar cases are there.

    Around 56% of unique foods are entered only once, that can be attributed to the fact that users are entering food names on their own and there may be many typos.

    It will be interesting to find out actually how many of them are really entered only once. We might check it later.

    Creating Food Dataframe

    Let's create dataframe specially for foods, where each food will have it's corresponding nutritions.

    Helper functions:

    Transforming to food dataframe:

    Following code is for creating dataframe, filtering out duplicates and adding them as new count column to unique entries. This code will be commented hence we save result dataframe and load it not to create it on every run.

    Now, let's filter out duplicates and also add count of each unique entry as new column.

    Finally, save our dataframe.

    And load it.

    Since we have food dataframe we can check distributions of nutritions in our data.

    It seems there are outliers that are spoiling the plots and data. They should be taken care of. For example:

    Those are definitely abnormal entries, some of them are typos, some of them are users being unserious. Anyway, those entries should be removed because they are polluting our data.

    Removing Outliers With IQR

    To remove outliers we should get a better understanding of our data and it's distribution. To do this we will use Interquartile range (IQR) method. The IQR is calculated as the difference between the 75th and the 25th percentiles of the data and defines the box in a box and whisker plot. The IQR defines the middle 50% of the data, or the body of the data. The IQR can be used to identify outliers by defining limits on the sample values that are a factor k of the IQR below the 25th percentile or above the 75th percentile. The common value for the factor k is the value 1.5. A factor k of 3 or more can be used to identify values that are extreme outliers or “far outs” when described in the context of box and whisker plots.

    Helper functions:

    Applying IQR:

    Now that we have Q1, IQR and Q3 scores for each nutrition column, we can get closer look at what are considered to be outliers by IQR method.

    331056 is almost 17% of our data and it is a bit too much. We will have to look at those ourselves and see what is going on. First, let's sort indices by corresponding values for each column and also make new dict containing values.

    Now, that we have our data in the way we want, let's take calories for example and look into more details.

    Calories:

    We can see that IQR method tells us that there are 109 thousand outliers just from calories. Let's look at how they are distributed.

    First thing that catches our eye is that there are foods with negative calories. Let's observe them more closely.

    We can see that these are not foods, so it is perfectly fine to remove these types of entries from our food database.

    After this we can notice that IQR method says that food with 486 calories is an outlier. But it is very normal for food to have 486 calories, for example, 1 large McDonald's French Fries is 500 calories. So, as suspected, IQR method doesn't filter out outliers very well and we have to interfere manually. At this point it would be good to employ help of expert in this sphere. But for now let's do this ourselves.

    After some research, of which we shall not go into details (it involved lot of googling food compositions in general and also comparing it to our data), it was decided that maximal calories for one dish should be 1500 and everything above that is an outlier and should be removed.

    Now that we have indices of elements that we acknoweldge as outliers of calories, we can move on to next nutritions.

    Carbs, Fat, Protein, Sodium, Sugar, Fiber, Potass., Iron, Calcium, Sat Fat, Chol, Vit A, Vit C, Trn Fat, Mon Fat, Ply Fat:

    We won't go into every step of outlier removal process for every nutrition. They are same as for calories. Thresholds list contains decided individual thresholds for respective nutritions.

    Deleting outliers:

    It looks like from our observations that 0.35% of our data were outliers, let's delete them from dataframe.

    Let us double check and double filter our results with another method of detecting outliers. We shall check it with projection method.

    Removing Outliers With Projection Method

    Projection methods are relatively simple to apply and quickly highlight extraneous values.

    We shall use PCA for reducing dimensionality. Also it will help us identify outliers with multivariate method.

    Standartizing the data:

    For PCA to function well, it is neccessary that data is standartized. It is even more important for our data because aside from having different variances, not everything is measured in same measurements, e.g. grams.

    Using PCA:

    Having only 25% of variance retained is probably bad, but we don't know exactly how bad it is. So, we have 2 options:

    Let's go through with second option.

    Visualizing data:

    Now let's draw some red lines for what we think are outliers. Everything that is outside of any red line will be considered as outlier.

    Now, let's filter all points and remember outliers.

    Deleting outliers:

    At the end, we can take a look at removed outliers.

    Collating Food Entries

    As we can see in our dataset there are same foods with different amounts of nutritions. This can be caused by different reasons:

    Because of this reasons majority of food appears several times in our dataset. To treat all this issues we can simply merge all different entries of same food into one, with logic that follows:

    Our goal is to not have zero values in any column, for this we shall replace them if possible.

    To be honest, writing and running above functions took us several days and we aren't going to run them again, ever. So, we strictly decided to save the output as CSV and load it on every run. In addition, running this function is not advisable, since it takes up to 2 hours of execution time, but previous 15 versions took much more (:D).

    Finally, our data looks like it's pretty clean. We reached our goal of having one unique entry for each food, that consists of collated data of multiple entries. We can look and check that it worked fine, since we have three different portions of Coffee - Brewed from grounds, each of them are unique and they are correlated with each others sizes (2 cup portion's nutritions are twice greater than 1 cup portion's nutritions, and 3 cup portion's nutritions are three times greater than 1 cup portion's nutritions). And same goes for almost every other entry.

    EDA (Exploratory Data Analysis)

    Now that we have our data how we want it, it will be pretty interesting to explore more details and truly see what we have to work with. Let's start with univariate analysis.

    Univariate Analysis

    Calcium:

    This plot looks pretty bad, let's check why is that.

    As we can see Calcium has non-zero value only in 2.7% of our data and that explains our plot's look. Let's plot only non-zero values.

    We can see that in high majority of cases when Calcium is not 0 it is less that 10, while range is from 0 to 106.

    Ply Fat:

    It looks even worse than case of Calcium, let's look at it more closely.

    We can see that Ply Fat has non-zero value only in 0.02% of our data, which is terribly low. Let's plot only non-zero values.

    We can see that most common value for Ply Fat when it is not 0, is 1 and it's range varies from 1 to 13.

    Carbs:

    Well, this plot looks different from above cases, let's get into more details.

    As we can see we have value of Carbs for 4/5 of our food. Let's see more of the plots.

    We can see that 75% of food have Carbs value of less than 25.

    Chol:

    It looks like similar case of Calcium

    As we can see Chol has non-zero value only in 1.4% of our data and that explains our plot's look. Let's plot only non-zero values.

    We can see that in high majority of cases when Chol is not 0 it is less that 50 and it varies from 0 to 648.

    Fiber:

    This plot is different from every above plot, let's investigate more.

    Above plot can be explained by the fact that Fiber has non-zero in 1/4 of our data, so it still contains 75% zeroes which is still high number, so let's plot non-zero entries.

    We can see that in high majority of cases when Fiber is not 0 it has value from 0 to 5, while it'srange varies from 0 to 39.

    Protein:

    It looks like we have similar case of Carbs.

    Our supposition above was correct, Protein is non-zero entered in almost 4/5 cases just like Carbs. Let's see more of the plots.

    Plots also look like Carbs plots but numbers ar changed. So if 75% of food had Carbs value of less than 25, in this case 75% of food have Protein value of less than 12.

    Calories:

    We aleardy knew that Calories would have best plot, because it is most entered nutrition for all foods. Let's check percentage of entries.

    As suspected almost every food has Calories value, only 1.65% doesn't. Let's see the distributions.

    We can see that high majority of entered food(83%) has Calories value of less than 300, also mean for foods is 176 Calories.

    Trn Fat:

    This plot looks pretty bad, let's check why is that.

    As we can see Trn Fat has non-zero value only in 0.01% of our data and that explains our plot's look. Let's plot only non-zero values.

    We can see that in majority of cases (61%) when Trn Fat is not 0 it is equal to 1.

    Fat:

    Let's check non-zero count of Fat.

    As we can see Fat has non-zero value in 70% of our data and that explains our plot's look. Let's look more into the distributions.

    We can see that 4/5 of enterd food has Fat value of less than 12, also mean for foods is 6.7 Fat.

    Potass.:

    This plot looks looks like case of Calcium, let's check it out.

    As we can see Potas. has non-zero value only in 1.5% of our data and that explains our plot's look. Let's plot only non-zero values.

    We can see that most of the food have Potass. value less than 500 while it is non-zero with mean of 269.

    Iron:

    Another one with look of case Calcium

    As we can see Iron has non-zero value only in 3.5% of our data and that explains our plot's look. Let's plot only non-zero values.

    We can see that in high majority of cases when Iron is not 0 it is less that 10.

    Sat Fat:

    Yet another case of case Calcium.

    As we can see Sat Fat has non-zero value only in 1.9% of our data and that explains our plot's look. Let's plot only non-zero values.

    We can see that in high majority of cases when Sat Fat is not 0 it is less that 7.

    Vit A:

    This plot looks like case of Ply Fat, let's check the similiarities.

    As we can see Vit A has non-zero value only in 0.07% of our data and that explains our plot's look also similarity to case Ply Fat which had 0.02% of non-zero data. Let's plot only non-zero values.

    We can see that in high majority almost 90% of cases when Vit A is not 0 it is in 0-50 range.

    Sodium:

    Let's check out non-zero percentage of Sodium.

    As we can see Sodium has non-zero value only in 55% of our data, which is not bad. Let's plot both plots: only non-zero and whole.

    Now let's compare means.

    It is clear that most of the Sodium is under 500 mark with mean of 329, which drops to 180 when we also count zeroes in.

    Vit C:

    This plot looks almost same as Ply Fat case.

    As we can see Calcium has non-zero value only in 0.12% of our data and that explains our plot's look. Let's plot only non-zero values.

    We can see that in high majority of cases when Vit C is not 0 it is less that 25.

    Sugar:

    This plot looks like a case of Sodium.

    As we can see Sugar has non-zero value only in 41% of our data and that explains our plot's look. Let's plot both plots: only non-zero and whole.

    Now let's compare means.

    It is clear that most of the Sugar is under 15 mark with mean of 9, which drops to 3.7 when we also count zeroes in.

    Mon Fat:

    This looks like yet another case of Ply Fat, let's check it out.

    As we can see Calcium has non-zero value only in 0.014% of our data and that explains our plot's look. Let's plot only non-zero values.

    We can see that in high majority of cases when Mon Fat is not 0 it is less or equal than 5. Also it's mean is almost 2.5.

    After all this now let's build correlations matrix to see correlations between nutritions.

    Correlations:

    We will use the heatmap to visualize the correlations between the variables.

    Looking at the heatmap we can notice that Calories is in correlation with Fat,Carbs,Protein, Sodium. Also, Sugar and Carbs are correlated with each other too. These correlations are probably incurred by the fact that Calories and nutrients that are correlated to it are probably most "popular" ones, so that in majority of cases these nutritions are entered and others are mostly zeroes. But high correlations like Calories and Fat or Calories and Carbs, will not be explained only by that, they are probably truly correlated.

    Feature Engineering

    Choosing clustering variables:

    As we have seen above, some nutritions are entered very rarely. We can again look at nutritions sorted percenteges.

    As we can see, there are 10 nutritions that are present in less then 4% of food entries, others are more than 25%. Since we are going to cluster food, these nutritions will pollute clusters, hence this information is very poor. So, we should not include them in clustering variables, that will also reduce dimensionality of clustering data.

    Creating new features:

    Food Heaviness

    Another description that can be applied to food is how heavy (მსუყე/მძიმე) it is. That can be determined by three nutritions: Calories, Fat, Sugar. So we can take their sum as measure of heaviness. But there is one problem their ranges and variances differ highly. That is a problem because for example: food with (0 Calories, 0 Fat, and 20 Sugar) will be way more heavy than food with (100 Calories, 0 Fat and 0 Sugar). To solve that problem, we have to normalize each column and then take their sum. So range of Food Heaviness will be [0,3].

    Now let's analyse our new feature.

    From above plots we can infer that customers are mostly eating light foods and that was probably expected from the nature of process with which was our data gathered. Also food heaviness is mostly between 0.05 and 0.3, while it's range is [0, 2.54].

    We will use this feature as our another clustering variable.

    Calories Category

    Since Calories is the most important food-defining feature for several reasons:

    We shall add Calories Category as a new feature.

    Also, Foods shall be categorized in 4 categories - low, low-average, high-average and high, denoted by numbers 0 to 3 respectively.

    Now let's get to analysing our new feature

    Now, let's try multivariate analysis with Calorie Category.

    Calorie Category and Carbs

    Correlation between Calories and Carbs is pretty clear from this plot. Let's look at means.

    Calorie Category and Fat

    Correlation between Calories and Fat is pretty clear from this plot. Let's look at means.

    Calorie Category and Protein

    Correlation between Calories and Protein is pretty clear from this plot. Let's look at means.

    These plots are pretty much more detailed visual representation of what correlations table said. We can infer that categories of calories also apply to these 3 nutritions as well. Because, for example high calories food with high probability has high protein too.

    We will use this feature as our another clustering variable.

    Data Preperation for Clustering

    Creating clustering dataframe:

    Let's create new dataframe that consists of only clustering variables.

    Reducing skewness:

    1. For left skewness, we take squares, cubes or higher powers;
    2. For right skewness, we take roots, logarithms or reciprocals (roots are weakest).

    Skewness formula:

    Helper functions:

    Reducing skewness for every feature:

    Normalization:

    We have only numerical variables in our datasets, but they have different ranges. So, we have to normalize them. Normalization is the second essential preprocessing step of clustering algorithms, since they are based on minimizing the Euclidean distance between the data set points and cluster centroids, which are sensitive to differences in magnitude or scale of the features.

    Clustering With K-Means

    Let's run K-Means 20 times with 1-20 clusters and check total costs. This will help us determine best number of clusters.

    Using PCA for Plotting Clusters

    Cluster Analysis

    First of all, let's look at how many elements are in each cluster and their percentage distribution.

    From this plot can be seen that count of elements in each cluster varies from 70000 to 200000, which is quite good.

    Feature distributions in clusters:

    Calcium:

    Ply Fat:

    Carbs:

    Chol:

    Fiber:

    Protein:

    Calories:

    Trn Fat:

    Fat:

    Potass.:

    Iron:

    Sat Fat:

    Vit A:

    Sodium:

    Vit C:

    Sugar:

    Mon Fat:

    Food Heaviness:

    Calorie Category:

    Carbs Fiber Protein Calories Fat Sodium Sugar Food Heaviness Calorie Category
    Cluster 0 Low; Mean: 5.8; N8 Average; Mean: 0.75; N4 Low; Mean: 1.77; N10 Low; Mean: 36; N10 Low; Mean: 0.88; N11 Low; Mean: 1; N9 Average; Mean: 2.22; N6 Low; Mean: 0.04; N10 Low
    Cluster 1 High; Mean: 37.8; N3 High; Mean: 4.48; N1 Average; Mean: 15.7; N5 High; Mean: 347; N3 High; Mean: 13.8; N3 High; Mean: 650; N2 Average; Mean: 2.76; N5 High; Mean: 0.36; N3 High
    Cluster 2 Average; Mean: 28; N4 Average; Mean: 0.49; N6 Low; Mean: 2.3; N8 Average; Mean: 149; N7 Low; Mean: 2.42; N9 Low; Mean: 9.98; N6 High; Mean: 20.7; N1 Average; Mean: 0.25; N4 Average
    Cluster 3 Average; Mean: 18.4; N5 High; Mean: 1.79; N3 Low; Mean: 4.24; N7 Average; Mean: 120; N9 Low; Mean: 3.2; N8 Low; Mean: 0.78; N10 Low; Mean: 0.76; N8 Average; Mean: 0.11; N9 Average
    Cluster 4 Low; Mean: 1.8; N10 Low; Mean: 0.04; N10 High; Mean: 21.2; N2 Average; Mean: 188; N5 Average; Mean: 9.62; N4 Low; Mean: 0.11; N12 Low; Mean: 0.3; N9 Average; Mean: 0.2; N5 Average
    Cluster 5 Low; Mean: 1.3; N11 Low; Mean: 0.05; N8 High; Mean: 21.3; N1 Average; Mean: 185; N6 Average; Mean: 9.36; N5 Average; Mean: 346; N3 Low; Mean: 0.18; N11 Average; Mean: 0.2; N6 Average
    Cluster 6 Low; Mean: 1.3; N12 Low; Mean: 0.03; N11 Low; Mean: 0.14; N11 Low; Mean: 13; N12 Low; Mean: 0.47; N12 Low; Mean: 1.31; N8 Low; Mean: 0.17; N12 Low; Mean: 0.01; N12 Low
    Cluster 7 High; Mean: 42; N2 Low; Mean: 0.03; N12 Average; Mean: 17; N3 High; Mean: 386; N2 High; Mean: 16.4; N1 High; Mean: 703; N1 High; Mean: 10.3; N2 High; Mean: 0.46; N1 High
    Cluster 8 Low; Mean: 7.2; N7 Low; Mean: 0.05; N9 Low; Mean: 0.09; N12 Average; Mean: 223; N4 Low; Mean: 3.53; N7 Low; Mean: 1.33; N7 Low; Mean: 0.19; N10 Average; Mean: 0.18; N7 Average
    Cluster 9 Low; Mean: 3.8; N9 Average; Mean: 0.28; N7 Low; Mean: 2.21; N9 Low; Mean: 34; N11 Low; Mean: 1.28; N10 Average; Mean: 147; N5 Average; Mean: 1.22; N7 Low; Mean: 0.04; N11 Low
    Cluster 10 Average; Mean: 15.7; N6 Average; Mean: 0.55; N5 Low; Mean: 4.48; N6 Average; Mean: 120; N8 Low; Mean: 4.23; N6 Average; Mean: 222; N4 Average; Mean: 3.49; N4 Average; Mean: 0.13; N8 Average
    Cluster 11 High; Mean: 42; N1 High; Mean: 2.34; N2 Average; Mean: 16.8; N4 High; Mean: 386; N1 High; Mean: 16.1; N2 Low; Mean: 0.25; N11 Average; Mean: 4.2; N3 High; Mean: 0.42; N2 High

    Let's start with most notable clusters.

    Most notable one is probably Cluster 7, It has low fiber, average protein and everythin other high. This cluster probably would be for unhealthy and heavy food. First example that comes to minds is Snickers, after checking its consistency it is clear that Snickers is the average representative of this cluster. Also pizzas and shawarmas would belong here too.

    Next cluster is Cluster 11, it is also pretty heavy but it has low sodium and not high sugar while having high fiber and carbs. Low sodium means that there is not much Smoked, cured, salted or canned meat, also no pizzas, buritos and shawarmas, this in combination with high fiber and carbs means that it has more food made of dough like Khachapuri and so on.

    Cluster 1 is only notably different from Cluster 11 with having high sodium average, while having little bit less of heaviness. This would be ideal cluster for sandwiches and stuff like that.

    Now that we are out of high calories clusters first thing that strikes the eye is Cluster 2, by having abnormally high sugar average, this cluster would definitely be for sugar sweets, other stats further encourage this statement too. But heavy sweets like snickers would still go in Cluster 7 though.

    If we look at Cluster 5 it would be ideal cluster for Mtsvadi or eggs with low carbs, fiber, high protein and higher average of sodium and calories. Also average heaviness.

    Cluster 3 would be ideal cluster for pastas/rice and stuff like that, because of high fiber, high end of cabrs. and average calories as well as heaviness.

    Cluster 6 is definitely for some drinks and lightest of foods, like water, cofee or strict diet foods/vegetables.

    Cluster 10 is more like for fruits, banana would be ideal representative of this cluster.

    Cluster 0 is definitely for porridges, cereals, berries and such light diet food.

    Cluster 4 would be for chicken breast, cheese, yogurt and other healthy foods with high protein and average calorie values.

    As for Clusters 8 and Cluster 9 they would still be for different kinds of fruits and diet foods.

    Clustering With SOMPY

    To reinforce our problem, we can use another clustering method - SOMPY (Self-organizing map), which is a type of artificial neural network that is trained using unsupervised learning to produce two-dimensional, discretized representation of the input space of the training samples, called a map.

    Import libraries:

    Creating models:

    At first, let's define some useful variables:

    And now let's train the model with different parameters. The more, the better. Each iteration is stored in disk for further study. For demonstration purposes we created only 20 models, since it takes too much time for building each of them. So, this code will be commented.

    Study the models trained and plot the errors obtained in order to select the best one.

    We need to minimize topographic error, so we will manually choose model with the lowest topographic error.

    Results:

    The components map shows the values of the variables for each prototype and allows us to extract conclusions consisting of non-linear patterns between variables. It shows the patterns learned by the neural network which are used to determine de winning neuron of each training instance.

    Hits-map:

    This visualization is very important because it shows how the instances are spreaded across the hexagonal lattice. The more instances lay into a cell, the more instances it is representing and hence the more we have to take it into acount.

    Clustering:

    Now, let's try to divide our data into 12 clusters. This visualization helps us to focus on the groups which share similar characteristics.

    Cluster Analysis

    We can make use of K-Means' clusters to simplify process of explaining SOMPY's clusters. For example, looking at the above described components map, we can notice that Calories are gathered mostly in the upper right corner. Using similar logic for every nutrition, we can conclude that, for example, SOMPY's 10th, 7th and 2nd clusters are similar to K_means' 11th, 7th and 1st clusters respectively. In addition, 6th clusters are same in both cluster sets, since they have lowest values for each nutrition. Observing mainly Protein and also other components we can infer that 3rd cluster in SOMPY is similar with 5th cluster in K-Means. 4th (SOMPY) and 9th (K-Means) clusters are also similar. Same goes for 11th (SOMPY) and 2nd (K-Means), 5th (SOMPY) and 3rd K-means and so on.

    Usage

    After clustering food we can start talking about usage of these clusters. As we mentioned in the start of this project there are many ways of using food clusters. These clusters can be useful both for app development and marketing reasons. We can briefly overview several of our usage ideas:

    First usage of our clusters could be for improving service with personalized suggestions. For example, if someone is taking more Calories than their goal, we could suggest food from the category that is closest to his intake, but has lower Calories consistency.

    As we know in the modern society a lot of peopel have trouble starting diet from a scratch. So, to address that issue, we can build a diet plan that starts from the current Food Heaviness level of user and reduces it gradually, suggesting food from different categories.

    Other usage could be for personalized advertisements. For example, if we know that user has tendency of eating food from 11th category (K-Means), we can show them McDonald's ad (probably we aren't supposed to that, since it is a healthy lifestyle app, but it's ok just for an example). Also, if user takes food daily from any category we can advertise other food from that category.

    ------------------------------------------------ THE END ------------------------------------------------